Run tests with miri #899

swfsql · 2023-12-05T13:02:37Z

Temporary patch on crossbeam.
- Note that crossbeam is an indirect dependency and they are still to release a version that passes on this miri check. So to temporarily get this out of the way, you can patch crossbeam on the workspace dfdx/Cargo.toml.
- Wait for crossbeam to make a new release and patch to that.
- Then try to update dependencies so the patch is no longer needed.
Check all test pass by some feature combinations.
- std,fast-alloc,cpu,nightly,safetensors
- std,cpu,nightly,safetensors
- no-std,fast-alloc,nightly
- no-std,nightly
Reduce test sizes so a miri pass is practical

To run a particular test, add it's rust path as a filter at the end of the miri run command:

# cargo clean (run once when `cargo miri` is not used)
cargo clean

# run all tests (note: very slow and should take hours, assuming there are no errors)
MIRIFLAGS='-Zmiri-strict-provenance -Zmiri-symbolic-alignment-check -Zmiri-retag-fields -Zmiri-disable-isolation -Zmiri-tree-borrows' \
cargo miri nextest run --test-threads 1 --no-default-features --features std,fast-alloc,cpu,nightly,safetensors

Miri UB errors

None found, both under dfdx package (74 tests) and under dfdx-core(293 tests).

Memory/Thread leaks

nn::layers::add_into::tests::longer_network
- Thread-leak error, disabling rayon on gemm for matmul seems to fix it. Commit: c6e0642
- Investigate why the leak happened, try to re-enable rayon usage.
tensor::cpu::tests::test_reuse_allocation_on_clone_tensor
- Memory-leak error, deallocating the cached allocations on TensorCache drop seems to fix it. Commit: ae777d4
- Avoid panics during drops? Instead of panicking, can emit error message warning about possible memory leaks.
None other found, both under dfdx package (74 tests) and under dfdx-core(293 tests).

Smaller tensors

Some tests have been reduced in tensor size so miri runs faster. Those changes are specific to the following commits:

The tensor reduction is quite optional, they only makes it simpler to run miri "on everything". But if this change ends up persisting, this wiki page should be updated. Finally, Idk how mha/transformer work, so I can't tell if the reduction shouldn't happen. That is, each test reduction should be reviewed to ack their reduction doesn't lose anything that should be tested.

Note that Miri depends on the test suite to find something, so having a bigger test suite may help. This is to say that a Miri check could be re-run in the future.
I haven't verified the cpu test coverage over the cpu source code, but a good way to go would be to have a high coverage (every relevant line of code getting "tested"). If more assurance would be desired, there could be more tests aiming at different characteristics of the source code, such as those related to branch coverage etc.

Remove ftz

Avoid ci errors

Note that crossbeam is an indirect dependency and they are still to release a version that [passes](crossbeam-rs/crossbeam#996) on this miri check. So to temporarily get this out of the way, you can patch crossbeam on the workspace `dfdx/Cargo.toml`.

Thread main finished before other threads were still active. The only thing related to threads were the gemm usage of rayon, and disabling rayon usage for matmul seems to be sufficient to fix this.

- Moved the impl of `Cache::try_empty_cache()` to `TensorCache::clear()`. - This can be invoked both by `Cache::try_empty_cache()` and by `drop(TensorCache)`. - Moved the device cache ptr deallocation to `BytesPtr`, `CudaBytesPtr` (newtype over `CUdeviceptr`) and `Buffer`. - This is abstracted by the `CachePtr` trait. - Can be called by `TensorCache::clear()`. - This method may require some "extra" device information, such as in the cuda case. That information is held by `TensorCache`.

swfsql · 2024-03-01T17:40:26Z

I'll be closing this as this was enough to indicate robustness on the library (at least in my opinion!), but nevertheless this can still be used as a reference whenever someone chooses to do a miri trial.

rainiwu and others added 7 commits January 26, 2024 00:29

remove deprecated ftz intrinsics

5c532ec

suppress spurious cargo clippy warning

fb91f13

Merge pull request #1 from rainiwu/remove-ftz

24a8593

Remove ftz

avoid conv1d bound for cudnn

4e3f7c7

bump gemm

a8bc54c

clippy fix

557687c

Merge pull request #2 from swfsql/avoid-ci-errors

1175903

Avoid ci errors

swfsql force-pushed the issue-897 branch from ae777d4 to 5f37c82 Compare March 1, 2024 16:12

swfsql added 5 commits March 1, 2024 11:15

reduce tensor sizes for 'slow' tests

ea879a1

reduce tensor sizes for 'slow' tests

6b15ab7

patch crossbeam

44a41c0

Note that crossbeam is an indirect dependency and they are still to release a version that [passes](crossbeam-rs/crossbeam#996) on this miri check. So to temporarily get this out of the way, you can patch crossbeam on the workspace `dfdx/Cargo.toml`.

miri pass nn::layers::add_into::tests::longer_network

4251a77

Thread main finished before other threads were still active. The only thing related to threads were the gemm usage of rayon, and disabling rayon usage for matmul seems to be sufficient to fix this.

swfsql force-pushed the issue-897 branch from 5f37c82 to 8f43ce5 Compare March 1, 2024 16:15

swfsql closed this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run tests with miri #899

Run tests with miri #899

swfsql commented Dec 5, 2023 •

edited

Loading

swfsql commented Mar 1, 2024

Run tests with miri #899

Run tests with miri #899

Conversation

swfsql commented Dec 5, 2023 • edited Loading

Miri UB errors

Memory/Thread leaks

Smaller tensors

swfsql commented Mar 1, 2024

swfsql commented Dec 5, 2023 •

edited

Loading